[TRTLLM-13543][feat] WideEP FT: add EPLB mask-only reconfigure (1b.1) by chienchunhung · Pull Request #15525 · NVIDIA/TensorRT-LLM

chienchunhung · 2026-06-22T21:25:11Z

Summary

Add mask-only EPLB reconfiguration in C++, nanobind, and Python.
Make dead-rank slots unreachable and preserve the mask across later dynamic EPLB updates.
Fail closed if masking would leave any expert without a live replica.
Reject direct reconfiguration while an iteration is active.

Scope

Standalone on main; no in-flight PR dependency.
HBM/capacity admission remains an integration-layer gate before calling this API.

Validation

git diff --check upstream/main..HEAD
python -m py_compile tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py tests/unittest/_torch/modules/test_moe_load_balancer.py

Summary by CodeRabbit

New Features
- Introduced capability to reconfigure MOE load balancing when compute ranks become unavailable, with automatic validation to ensure all experts retain at least one active replica.
Tests
- Added tests for mask-only reconfiguration scenarios including replica validation, active iteration protection, and edge case handling.

chienchunhung · 2026-06-22T22:17:52Z

/bot run

tensorrt-cicd · 2026-06-22T22:24:11Z

PR_Github #55089 [ run ] triggered by Bot. Commit: 7651902 Link to invocation

tensorrt-cicd · 2026-06-22T23:45:20Z

PR_Github #55089 [ run ] completed with state FAILURE. Commit: 7651902
/LLM/main/L0_MergeRequest_PR pipeline #44075 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

coderabbitai · 2026-06-23T05:17:33Z

📝 Walkthrough

Walkthrough

Adds reconfigureMaskOnly(deadRanks) to MoeLoadBalancer to reconfigure expert placement by masking dead ranks without full rebalancing. The C++ placement core (doReplication, doPlacement) gains an optional dead-rank mask parameter; iteration lifecycle tracking (mIterActive) is introduced; and matching Python bindings and unit tests are added.

Changes

MoeLoadBalancer Mask-Only Reconfiguration

Layer / File(s)	Summary
Public API contracts and new member declarations `cpp/tensorrt_llm/runtime/moeLoadBalancer/moeLoadBalancer.h`	Adds `reconfigureMaskOnly`, three `SingleLayerMoeLoadBalancer` helpers, `getDeadRankMaskSnapshot`, `mIterActive` boolean, `mDeadRankMask`/`mDeadRankMaskMutex` members, and updates `doReplication`/`doPlacement` free-function signatures to accept an optional `deadRankMask` pointer.
Dead-rank mask support in `doReplication` and `doPlacement` `cpp/tensorrt_llm/runtime/moeLoadBalancer/moeLoadBalancer.cpp`	Introduces `isRankMasked`/`getActiveSlotCount` utilities. `doReplication` computes `totalSlotCount` from active ranks and checks a replica floor. `doPlacement` pre-fills all `rankExpertIds` to `-1`, seeds the priority queue only with unmasked ranks, and reworks GPU slot construction to skip invalid slots and assert assigned counts.
`SingleLayerMoeLoadBalancer` and `MoeLoadBalancer::reconfigureMaskOnly` implementation `cpp/tensorrt_llm/runtime/moeLoadBalancer/moeLoadBalancer.cpp`	Per-layer helpers compute surviving replica counts, validate every expert retains at least one replica, and trigger CPU/GPU placement. Top-level `reconfigureMaskOnly` locks against concurrent iterations, builds the candidate mask, waits for layer updates, validates all layers, stores the mask, and reconfigures each layer. `startIter`/`endIter` now maintain `mIterActive`; weight-update routines snapshot and forward the mask.
nanobind and Python wrapper `cpp/tensorrt_llm/nanobind/runtime/moeBindings.cpp`, `tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py`	nanobind exposes `reconfigure_mask_only(dead_ranks)` with GIL release. The Python `MoeLoadBalancer.reconfigure_mask_only` method guards against active iteration via `self.in_iter` and delegates to `load_balancer_impl`.
C++ and Python unit tests `cpp/tests/unit_tests/runtime/moeLoadBalancerTest.cpp`, `tests/unittest/_torch/modules/test_moe_load_balancer.py`	Three C++ tests cover slot invalidation for dead ranks, rejection during active iteration, and rejection when masking the last replica. Two Python tests cover lifecycle forwarding to the mock and `RuntimeError` on active iteration.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 22.86% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main change: adding mask-only reconfiguration to Expert Pipeline Load Balancer (EPLB) for WideEP FT, which aligns with the PR's primary functionality.
Description check	✅ Passed	The PR description provides a clear summary of changes, scope, and validation steps. However, it lacks explicit test coverage documentation as required by the template.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cpp/tensorrt_llm/runtime/moeLoadBalancer/moeLoadBalancer.cpp (1)

153-175: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Make the mask-dependent replica-count invariant a runtime check.

Line 159 makes totalSlotCount depend on the active mask, but Line 175 only checks the replica-count sum with assert. In release builds, a stale or mismatched expertReplicaCount can drain pq and make pq.top() undefined behavior. Use TLLM_CHECK_WITH_INFO here.

Proposed fix

-    assert(static_cast<int>(allReplicas.size()) == totalSlotCount);
+    TLLM_CHECK_WITH_INFO(static_cast<int>(allReplicas.size()) == totalSlotCount,
+        "Replica count sum (%ld) must match active slot count (%d)", static_cast<long>(allReplicas.size()),
+        totalSlotCount);

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tensorrt_llm/runtime/moeLoadBalancer/moeLoadBalancer.cpp` around lines
153 - 175, The assertion check at the end of the function that compares
allReplicas.size() to totalSlotCount is only enforced in debug builds. Since
totalSlotCount depends on the active mask and can mismatch with
expertReplicaCount, this invariant violation could cause undefined behavior in
release builds. Replace the assert statement with TLLM_CHECK_WITH_INFO to ensure
this critical runtime check is performed in all build configurations, and
provide a descriptive error message that explains the mismatch between the
expected slot count and actual replica count.

🧹 Nitpick comments (2)

tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py (1)
992-992: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Use 3.10-native typing and an explicit return type in the new API signature.

Align this new method with repo typing rules by switching to list[int] and annotating -> None.
♻️ Proposed fix
-    def reconfigure_mask_only(self, dead_ranks: List[int]):
+    def reconfigure_mask_only(self, dead_ranks: list[int]) -> None:
As per coding guidelines, "Always annotate functions with return types" and "Prefer using the built-in types list, dict, and tuple to the legacy typing.List, typing.Dict, and typing.Tuple."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py` at line 992, The
reconfigure_mask_only method signature does not conform to the repository's
typing standards. Update the method signature by replacing the legacy List[int]
type hint with the Python 3.10-native list[int], and add an explicit return type
annotation of -> None to the method definition to align with the coding
guidelines that require all functions to have annotated return types and prefer
built-in generic types over typing module imports.
Source: Coding guidelines
tests/unittest/_torch/modules/test_moe_load_balancer.py (1)
272-304: 📐 Maintainability & Code Quality | 🔵 Trivial

Coverage assessment: sufficient for this PR, with one optional follow-up outside it.

Coverage is sufficient here: tests/unittest/_torch/modules/test_moe_load_balancer.py validates Python forwarding and active-iteration rejection, and cpp/tests/unit_tests/runtime/moeLoadBalancerTest.cpp validates dead-rank masking plus fail-closed behavior. Optional follow-up outside this PR: add one non-mocked Python integration case in tests/unittest/_torch/modules/test_moe_load_balancer.py to verify backend fail-closed exceptions surface end-to-end.

As per path instructions, "tests/**: Act as a QA engineer reviewing test changes and coverage for TensorRT-LLM. Keep feedback actionable: suggest concrete list file names and whether coverage is sufficient, insufficient, or needs follow-up outside the PR."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/modules/test_moe_load_balancer.py` around lines 272 -
304, The current test coverage in test_moe_load_balancer.py is sufficient for
this PR, but there is an optional follow-up suggestion to add one non-mocked
Python integration test case to the test_moe_load_balancer.py file that verifies
backend fail-closed exceptions surface end-to-end. This test should exercise the
MoeLoadBalancer with actual backend behavior (not mocked) to ensure that
exceptions from fail-closed scenarios are properly propagated and not swallowed
or incorrectly handled during the MoeLoadBalancer lifecycle operations.
Source: Path instructions

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/tensorrt_llm/runtime/moeLoadBalancer/moeLoadBalancer.cpp`:
- Around line 1127-1134: The mDeadRankMask is being updated before all layers
are reconfigured with the candidateDeadRankMask, which can expose an
inconsistent state if layer->reconfigureMaskOnly() throws an exception. Move the
entire lock_guard block that updates mDeadRankMask to after the for loop
iterating over mLayers completes successfully. This ensures the shared mask is
only published once all per-layer reconfiguration has succeeded, preventing
other threads from observing a partially-applied mask state.
- Around line 60-78: The `getActiveSlotCount` function does not validate that
the `deadRankMask` vector has the correct size before using it. Currently, if
the mask is shorter than `metaInfo.epSize`, ranks beyond the vector size are
treated as active in the `isRankMasked` check, which can leave dead-rank slots
reachable. Add validation at the beginning of `getActiveSlotCount` to ensure
that if `deadRankMask` is provided (not nullptr), its size must equal
`metaInfo.epSize`. If the size does not match, handle the error appropriately
(such as asserting or returning an error indicator).

---

Outside diff comments:
In `@cpp/tensorrt_llm/runtime/moeLoadBalancer/moeLoadBalancer.cpp`:
- Around line 153-175: The assertion check at the end of the function that
compares allReplicas.size() to totalSlotCount is only enforced in debug builds.
Since totalSlotCount depends on the active mask and can mismatch with
expertReplicaCount, this invariant violation could cause undefined behavior in
release builds. Replace the assert statement with TLLM_CHECK_WITH_INFO to ensure
this critical runtime check is performed in all build configurations, and
provide a descriptive error message that explains the mismatch between the
expected slot count and actual replica count.

---

Nitpick comments:
In `@tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py`:
- Line 992: The reconfigure_mask_only method signature does not conform to the
repository's typing standards. Update the method signature by replacing the
legacy List[int] type hint with the Python 3.10-native list[int], and add an
explicit return type annotation of -> None to the method definition to align
with the coding guidelines that require all functions to have annotated return
types and prefer built-in generic types over typing module imports.

In `@tests/unittest/_torch/modules/test_moe_load_balancer.py`:
- Around line 272-304: The current test coverage in test_moe_load_balancer.py is
sufficient for this PR, but there is an optional follow-up suggestion to add one
non-mocked Python integration test case to the test_moe_load_balancer.py file
that verifies backend fail-closed exceptions surface end-to-end. This test
should exercise the MoeLoadBalancer with actual backend behavior (not mocked) to
ensure that exceptions from fail-closed scenarios are properly propagated and
not swallowed or incorrectly handled during the MoeLoadBalancer lifecycle
operations.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8ea43521-47c2-40dc-97de-03fb29f90a56

📥 Commits

Reviewing files that changed from the base of the PR and between 9ed7ce4 and 7651902.

📒 Files selected for processing (6)

cpp/tensorrt_llm/nanobind/runtime/moeBindings.cpp
cpp/tensorrt_llm/runtime/moeLoadBalancer/moeLoadBalancer.cpp
cpp/tensorrt_llm/runtime/moeLoadBalancer/moeLoadBalancer.h
cpp/tests/unit_tests/runtime/moeLoadBalancerTest.cpp
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
tests/unittest/_torch/modules/test_moe_load_balancer.py

chienchunhung · 2026-06-23T17:05:05Z

/bot run

chienchunhung · 2026-06-23T17:07:01Z

/bot run

tensorrt-cicd · 2026-06-23T17:12:06Z

PR_Github #55289 [ run ] triggered by Bot. Commit: 088dd5c Link to invocation

tensorrt-cicd · 2026-06-23T17:12:41Z

PR_Github #55290 [ run ] triggered by Bot. Commit: 088dd5c Link to invocation

tensorrt-cicd · 2026-06-23T17:12:43Z

PR_Github #55289 [ run ] completed with state ABORTED. Commit: 088dd5c

Link to invocation

tensorrt-cicd · 2026-06-23T20:55:49Z

PR_Github #55290 [ run ] completed with state FAILURE. Commit: 088dd5c
/LLM/main/L0_MergeRequest_PR pipeline #44241 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

chienchunhung · 2026-06-23T23:12:22Z

/bot run --disable-fail-fast --stage-list "L40S-PyTorch-1,H100_PCIe-CPP-1"

tensorrt-cicd · 2026-06-23T23:17:45Z

PR_Github #55343 [ run ] triggered by Bot. Commit: 088dd5c Link to invocation

tensorrt-cicd · 2026-06-24T00:51:41Z

PR_Github #55343 [ run ] completed with state SUCCESS. Commit: 088dd5c
/LLM/main/L0_MergeRequest_PR pipeline #44292 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

chienchunhung · 2026-06-24T05:47:50Z

/bot run --disable-fail-fast

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

chienchunhung · 2026-06-24T20:54:00Z

/bot run

tensorrt-cicd · 2026-06-24T21:02:11Z

PR_Github #55581 [ run ] triggered by Bot. Commit: 5f391a6 Link to invocation

tensorrt-cicd · 2026-06-24T23:00:38Z

PR_Github #55581 [ run ] completed with state FAILURE. Commit: 5f391a6
/LLM/main/L0_MergeRequest_PR pipeline #44499 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

chienchunhung · 2026-06-24T23:19:09Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-24T23:25:40Z

PR_Github #55614 [ run ] triggered by Bot. Commit: 5f391a6 Link to invocation

tensorrt-cicd · 2026-06-25T06:02:57Z

PR_Github #55614 [ run ] completed with state FAILURE. Commit: 5f391a6
/LLM/main/L0_MergeRequest_PR pipeline #44532 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

github-actions Bot assigned chienchunhung Jun 22, 2026

chienchunhung mentioned this pull request Jun 22, 2026

[None][feat] WideEP FT: add EPLB mask-only reconfigure chienchunhung/TensorRT-LLM#4

Closed

chienchunhung force-pushed the WideEP-FT/1b.1-reconfigure-masking branch 2 times, most recently from ee6dddf to 7651902 Compare June 22, 2026 22:15

chienchunhung changed the title ~~[TRTLLM-12557][feat] WideEP FT: add EPLB mask-only reconfigure (1b.1)~~ [TRTLLM-13543][feat] WideEP FT: add EPLB mask-only reconfigure (1b.1) Jun 22, 2026

chienchunhung requested a review from dongxuy04 June 23, 2026 02:46

chienchunhung marked this pull request as ready for review June 23, 2026 05:08

chienchunhung requested a review from a team as a code owner June 23, 2026 05:08

chienchunhung requested a review from yuxianq June 23, 2026 05:08

coderabbitai Bot reviewed Jun 23, 2026

View reviewed changes

Comment thread cpp/tensorrt_llm/runtime/moeLoadBalancer/moeLoadBalancer.cpp

Comment thread cpp/tensorrt_llm/runtime/moeLoadBalancer/moeLoadBalancer.cpp Outdated

chienchunhung force-pushed the WideEP-FT/1b.1-reconfigure-masking branch from 7651902 to 801713d Compare June 23, 2026 17:04

chienchunhung force-pushed the WideEP-FT/1b.1-reconfigure-masking branch from 801713d to 088dd5c Compare June 23, 2026 17:06

chienchunhung added 2 commits June 24, 2026 13:53

[None][feat] WideEP FT: add EPLB mask-only reconfigure

a3ac34c

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

fix: address EPLB mask review feedback

5f391a6

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

chienchunhung force-pushed the WideEP-FT/1b.1-reconfigure-masking branch from 088dd5c to 5f391a6 Compare June 24, 2026 20:53

Uh oh!

Conversation

chienchunhung commented Jun 22, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Scope

Validation

Summary by CodeRabbit

Uh oh!

chienchunhung commented Jun 22, 2026

Uh oh!

tensorrt-cicd commented Jun 22, 2026

Uh oh!

tensorrt-cicd commented Jun 22, 2026

Uh oh!

coderabbitai Bot commented Jun 23, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

chienchunhung commented Jun 23, 2026

Uh oh!

chienchunhung commented Jun 23, 2026

Uh oh!

tensorrt-cicd commented Jun 23, 2026

Uh oh!

tensorrt-cicd commented Jun 23, 2026

Uh oh!

tensorrt-cicd commented Jun 23, 2026

Uh oh!

tensorrt-cicd commented Jun 23, 2026

Uh oh!

chienchunhung commented Jun 23, 2026

Uh oh!

tensorrt-cicd commented Jun 23, 2026

Uh oh!

tensorrt-cicd commented Jun 24, 2026

Uh oh!

chienchunhung commented Jun 24, 2026

Uh oh!

chienchunhung commented Jun 24, 2026

Uh oh!

tensorrt-cicd commented Jun 24, 2026

Uh oh!

tensorrt-cicd commented Jun 24, 2026

Uh oh!

chienchunhung commented Jun 24, 2026

Uh oh!

tensorrt-cicd commented Jun 24, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chienchunhung commented Jun 22, 2026 •

edited by coderabbitai Bot

Loading